High-Confidence Off-Policy Evaluation
نویسندگان
چکیده
Many reinforcement learning algorithms use trajectories collected from the execution of one or more policies to propose a new policy. Because execution of a bad policy can be costly or dangerous, techniques for evaluating the performance of the new policy without requiring its execution have been of recent interest in industry. Such off-policy evaluation methods, which estimate the performance of a policy using trajectories collected from the execution of other policies, heretofore have not provided confidences regarding the accuracy of their estimates. In this paper we propose an off-policy method for computing a lower confidence bound on the expected return of a policy.
منابع مشابه
Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation
In many reinforcement learning applications executing a poor policy may be costly or even dangerous. Thus, it is desirable to determine confidence interval lower bounds on the performance of any given policy without executing said policy. Current methods for high confidence off-policy evaluation require a substantial amount of data to achieve a tight lower bound, while existing model-based meth...
متن کاملPredictive Off-Policy Policy Evaluation for Nonstationary Decision Problems, with Applications to Digital Marketing
In this paper we consider the problem of evaluating one digital marketing policy (or more generally, a policy for an MDP with unknown transition and reward functions) using data collected from the execution of a different policy. We call this problem off-policy policy evaluation. Existing methods for off-policy policy evaluation assume that the transition and reward functions of the MDP are sta...
متن کاملEfficient Probabilistic Performance Bounds for Inverse Reinforcement Learning
In the field of reinforcement learning there has been recent progress towards safety and high-confidence bounds on policy performance. However, to our knowledge, no methods exist for determining high-confidence safety bounds for a given evaluation policy in the inverse reinforcement learning setting—where the true reward function is unknown and only samples of expert behavior are given. We prop...
متن کاملAn Empirical Analysis of Off-policy Learning in Discrete MDPs
Abstract Off-policy evaluation is the problem of evaluating a decision-making policy using data collected under a different behaviour policy. While several methods are available for addressing off-policy evaluation, little work has been done on identifying the best methods. In this paper, we conduct an in-depth comparative study of several off-policy evaluation methods in non-bandit, finite-hor...
متن کاملQ($\lambda$) with Off-Policy Corrections
We propose and analyze an alternate approach to off-policy multi-step temporal difference learning, in which off-policy returns are corrected with the current Q-function in terms of rewards, rather than with the target policy in terms of transition probabilities. We prove that such approximate corrections are sufficient for off-policy convergence both in policy evaluation and control, provided ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015